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FINAL  REPORT 


This  is  the  final  report  of  a  study  of  special  computing  machinery 
for  computations  on  sparse  matrices,  under  contract  with  the  Office  of 
Naval  Research  N000lU-79-C-0^77 

The  elements  of  the  computing  system  and  its  application  to  Gaussian 
elimination  are  described  in  some  detail  in  the  paper  "On  a  Special 
Purpose  Matrix  Array-Processor"  by  Philip  N.  Armstrong  and  David  G-  Cantor, 
which  has  been  submitted  for  publication  and  comprises  the  major  part  of 
this  report.  This  paper  describes  the  system  and  the  algorithms  for  which 
it  is  designed,  so  that  a  reader  who  is  acquainted  with  computations  on 
matrices  will  understand  the  system's  utility  for  such  computations. 

Some  additional  remarks  regarding  selected  matrix  computations  may  be 
useful  and  are  included  here,  although  what  is  written  below  may  be 
inferred  from  the  substance  of  the  paper.  Reading  the  paper  is  a  pre¬ 
requisite  for  these  remarks. 


1.  Matrix  transformations  of  a  vector:  computing  Ax  and  Ax. 

Assume  that  A  is  an  n  x  n  matrix  and  that  column  a  .  of  A  is 
stored  in  module  M.,  i-e.  A  is  treated  as  though  it  were  dense.  If 

t) 

the  j  element  of  x  is  in  processor  P.,  module  M.,  computation  of 

J  J 

T 

either  Ax  or  A  x  will  require  n  ticks.  More  generally,  if  A  has 
m  rows  and  n  columns,  stored  in  the  above  manner,  max(m,n)  ticks  will 
be  required. 


2.  Forward  and  back  substitution:  y  =  L  ^b  and  x  =  U  ^y. 

Suppose  L  is  an  n  x  n  lower  triangular  matrix,  U  is  an  upper 
triangular  matrix,  and  b  is  a  vector  with  n  components.  The  computa- 
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tions  L  d  and  U  y  will  each  require  n  ticks.  The  n  entries  of 

L  must  be  accessible  in  order  of  increasing  row  indices;  the  entires  of  U 

must  be  accessible  in  order  of  decreasing  row  indices. 

3.  Solving  the  system  Ax  =  b,  where  A  is  an  n  x  n  invertible  matrix. 
The  entries  of  x  may  be  obtained  by  Gaussian  elimination  with 

partial  pivoting  in  n  +  3n  ticks,  assuming  fill-in  does  not  exceed 

2 

the  memory  capacity  of  the  system.  After  the  first  n  ticks,  the  vector 
Fb  may  be  stored,  where  P  is  the  permutation  matrix  defined  by  the 
partial  pivoting  operations,  so  that  in  n^  +•  n  ticks  the  LU  fonn  will 
be  completed  and  the  n  components  of  Pb  will  be  in  place.  The  systems 
Ly  =  Fb  and  Ux  =  y  may  then  be  solved  in  the  following  2n  ticks.  This 
amount  of  time  may  be  decreased  if  A  is  sparse.  For  details,  see  page  20 
of  the  paper. 

4.  Updating  the  LU  form  of  an  n  x  n  matrix. 

Suppose  the  LU  form  of  a  matrix  has  been  obtained  by  Gaussian  elimi¬ 
nation  with  partial  pivoting,  using  the  array-processor.  Then  suppose 
the  matrix  A  is  changed  to  a  new  matrix  A  by  the  exchange  of  a  column 

a  .  for  a  new  column  a  ..  This  is  the  "pivot  step"  of  the  Simplex  algo- 
*<] 

rithm  for  linear  programming,  as  described  in  the  paper  "Numerical 
Techniques  in  Mathematical  Programming"  by  R.H.  Bartels,  G.H.  Golub,  and 
M-A-  Saunders,  in  the  book  Nonlinear  Programming,  Academic  Press,  New  York, 
1970.  This  computation  requires  no  more  than  5n  -  j  +  1  ticks,  using 
the  method  described  in  the  paper.  In  this  method,  an  upper  Hessenberg 
matrix  is  reduced  to  its  LU  form  and  the  elements  of  the  lower  triangular 
matrix  which  is  produced  are  stored,  using  no  more  storage  than  that 
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allocated  for  one  column  of  the  original  matrix  A.  Similar  computations 
will  produce  matrices  factored  in  this  way  with  new  column  interchanges. 
With  each  subsequent  decomposition,  the  storage  will  be  similarly 
increased,  and  the  amount  of  time  consumed  by  the  procedure  will  increase 
by  one  tick.  Periodically,  recomputation  of  the  LU  form  will  be  per¬ 
formed  (as  is  customary  in  applications  of  the  Simplex  algorithm). 

5.  Householder  transformations . 

If  A  is  an  m  x  n  matrix  stored  by  column  in  the  modules  so  that 

each  column  uses  one  module  of  the  array,  A  may  be  reduced  to  a  matrix 

R  by  an  orthogonal  matrix  Q,,  which  is  the  product  of  Householder 

T 

transformations  of  the  form  I  -  Suppose  A^  is  produced  by 

k  such  transformations,  starting  with  matrix  A  (=  AQ).  The  first  step 

in  computing  A^  is  to  scan  the  last  n  -  k  +  1  columns  of  Ak-1  and 

compute  the  lengths  of  these  columns.  The  column  of  greatest  length  will 

t  h 

be  selected  and  moved  to  the  k  column  position.  This  selection  and 
movement  of  the  column  will  consume  2n  +  j  -  k  ticks,  where  a  .  is  the 
selected  column.  There  will  then  remain  2(n-k)  ticks  for  the  transfor¬ 
mation  itself.  When  the  entries  of  several  columns  are  in  a  single  module, 
the  selection  of  the  column  of  greatest  length  may  be  done  by  scanning 
each  column  twice  so  that  the  columns  may  be  sorted  by  column  length, 
after  which  the  requisite  column  permutation  and  arithmetic  computations 
may  be  done.  This  selection  procedure  illustrates  the  utility  of  the  flag 
bits  and  the  facilities  within  the  processors  for  changing  the  keys  over 
which  the  words  are  sorted. 


V, 
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I...  Permutations  of  rows  and.  columns. 

Permutations  on  the  rows  and  columns  of  matrixes  to  reduce  fill-in  or 
to  increase  numerical  stability  are  discussed  in  the  literature. 

It  suffices  to  note  that  all  standard  permutations  can  be  performed. 
For  example,  it  is  often  convenient  to  arrange  the  columns  of  a  matrix  so 
that  the  column  indices  of  the  first  non-zero  row  entries  are  in  increasing 
order.  This  requires  a  perliminary  scan  of  the  matrix,  of  course.  Other 
procedures,  e.g.  formation  of  the  transitive  closure  of  the  graph  repre¬ 
sented  by  the  non-zero  entries  of  the  matrix  and  determining  whether  this 
matrix  is  symmetric  are  easily  performed  in  this  system. 


7.  Computations  which  require  augmentation  of  the  system. 

As  may  be  seen  from  (l)  of  this  report,  no  extra  time  is  required  for 

T 

forming  the  transpose  of  a  matrix  when  computing  A  x  instead  of  Ax. 

There  are,  however,  computations  such  as  the  one  above  in  determining  the 
symmetry  of  the  transitive  closure,  or  two-dimensional  Fourier  transforms, 
for  which  the  transpose  of  a  matrix  is  either  required,  or  at  least  useful. 
Thus,  it  may  be  desirable  to  provide  data  paths  to  expedite  this  procedure. 


If  module  M.  is  connected  to  those  modules  M.  for  which  i  -  j  is  a 
1  J  0 

power  of  2,  an  additional  (approximately)  log^n  data  paths  will  be 


connected  to  each  mocule,  where  n  is  the  total  number  of  modules  in  the 


array.  This  provision  makes  the  computation  of  the  two-dimensional 
Fourier  transform  on  n"~  entries  in  approximately  2n  log^n  ticks 
feasible  --  of  there  are  at  least  n  modules  in  the  system. 


It  appears  that  many  other  computations  may  be  made  feasible,  or 
expedited,  if  more  than  one  SSM  is  provided  at  each  module.  An  example 
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of  this  use  of  storage  is  given  on  page  20  of  the  paper.  This  matter 
remains  to  be  examined,  along  with  the  improvement  in  system  performance 
to  be  expected  from  additional  memory  elements. 

Philip  N.  Armstrong 
December  15 ,  1980 
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On  a  special  purpose  matrix  array-processor 
Philip  N.  Armstrong1  and  David  G.  Cantor^ 

The  advent  of  very  large  scale  integrated  circuitry  (VLSI) 
has  increased  the  interest  in  special  purpose  array-processors, 
which  perform  certain  algorithms  extremely  rapidly,  for  use 
as  peripheral  devices  on  computers.  To  be  economical,  these 
processors  should  be  built  from  many  copies  of  a  few  simple 
devices,  with  interconnections  kept  to  a  minimum.  They  should 
allow  for  parallel  processing  and  pipelining.  See,  for  example, 
Jones  and  Schwartz  (and  the  references  cited  therein)  [5]. 

The  purpose  of  this  paper  is  to  present  a  novel  design  for 
a  matrix  array-processor.  Such  designs  have  been  presented 
before.  See  for  example  "Systolic  arrays  (for  VLSI)"  by  H.T .  Kung 
and  Charles  E.  Leiserson  [7]  or  Chapter  8  of  Mead  and  Conway  [8]. 
For  a  general  survey  of  parallel  matrix  computation  schemes,  see 
Heller  (and  the  references  cited  therein)  [3], 

Our  proposed  system  is  radically  simpler,  and  correspondingly 

slower,  than  that  of  Kung  and  Leiserson.  For  example  to  invert 

•  2 
a  dense  n  by  n  matrix,  their  system  would  require  n  processors 

and  take  n  basic  units  of  time,  while  ours  has  no  particular 

requirement  on  the  number  of  processors,  but  with  n  (or  more) 

2 

processors  takes  n  units  of  time. 

However,  our  system  requires  far  fewer  interconnections, 
thus  allowing  its  extension  onto  numerous  VLSI  chips  (as  and  if 
necessary).  In  addition  it  is  able  to  perform  all  standard 
matrix  operations  including  partial  pivoting  during  Gaussian 

1.  Supported  in  part  by  ONR  Contract  N0014-79-C-0477 

2.  Supported  in  part  by  NSF  Grant  MCS-79-03711 
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elimination  as  well  as  updating  of  the  LU  form  of  a  basis,  which 
is  required,  for  example,  during  use  of  the  (revised  form  of 
the)  simplex  algorithm  (see  Orchard-Hays  [9]  and  Tomlin  [12]). 
The  proposed  system  can  efficiently  perform  both  Givens  and 
Householder  transformations  in  parallel  [4 ],  preliminary  to  com¬ 
putation  of  eigenvalues  and  eigenvectors  of  (not  necessarily 
symmetric)  matrices.  It  takes  advantage  of  sparse  matrices,  even 
if  irregular  (i.e.  not  necessarily  band-limited  matrices). 
Finally,  our  system  can  efficiently  perform  the  Fast  Fourier 
Transform  and  provides  an  excellent  sorting  mechanism  for  lists 
of  numbers.  We  give  examples  of  how  some  of  the  above  are  per- 
f  ormed . 

The  major  novelty  of  our  system  is  a  non-standard  memory, 
not  addressable  (in  the  usual  sense)  and  thus  not  requiring  the 
extensive  addressing  logic  (and  associated  connections)  used  in 
ordinary  computer  memory,  together  with  appropriate  formulations 
of  the  standard  algorithms  which  work  with  this  type  of  memory. 
The  memory  provides  a  type  of  automatic  list-processing. 

1 .  The  memory 

The  memory  consists  of  independent  modules,  each  a  "self- 
sorting"  memory  (SSM),  as  described  by  Armstrong  and  Rem  in  [ 1] . 
A  simplified  version  of  the  memory  is  illustrated  in  figure  1. 

It  consists  of  two  sequences  of  binary  shift  registers 
UltU2,...,UN  and  Lj ,L2 , . . . ,L  .  All  of  the  IT  hold  u>  1 
bits  and  all  of  hold  f.  >  1  bits;  w  =  u  +  Z  is  called 

the  word  length  of  the  memory  (typical  values  for  w  range 
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from  50  to  200).  The  IT  and  are  interconnected  by 

logical  units  CT  called  sorters  [6].  A  sorter  can  be  in 
one  of  three  states:  U  (undetermined),  S  (straight-through) 
or  N  (normal).  There  are  (simple)  controls  on  the  memory 
to  place  each  of  its  sorters  into  the  state  U  at  a  specified 
time . 

The  sorter  has  two  inputs:  the  left  one  is  called  the  low 
input  and  the  right  one  is  called  the  high  input.  Similarly 
there  are  two  outputs:  the  left  one  is  called  the  low  output 
and  the  right  one  is  called  the  high  input.  In  state  N  the 
low  input  is  connected  to  the  low  output  and  the  high  input  is 
connected  to  the  high  output.  In  state  S,  this  is  reversed, 
and  the  low  input  is  connected  to  the  high  output  and  the  high 
input  is  connected  to  the  low  output.  See  figure  2. 

The  shift  registers  all  operate  in  synchronism,  shifting 
one  bit  each  bit  time.  If  a  sorter  is  in  state  U  and  the  two 
inputs  are  the  same  it  will  remain  in  state  U  and  transmit 
(without  delay)  the  (common)  value  of  the  inputs  to  both  outputs. 

If  however  the  low  input  receives  a  0  and  the  high  input  receives 
a  1,  then  the  sorter  goes  into  state  N,  while  if  the  low  input 
receives  a  1  and  the  high  input  receives  a  0  the  sorter  goes 
into  state  S.  Once  in  state  N  or  S,  the  sorter  remains  in 
that  state  until  a  control  signal  changes  it. 

The  upper  left  input  of  the  entire  module  is  (similarly) 
called  the  module’s  low  input  and  the  lower  right  input  is  called 
the  high  input.  Similarly  the  lower  left  and  upper  right  outputs 
are  called,  respectively,  the  low  output  and  the  high  output. 
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The  period  of  this  memory  is  w  (bit  times).  We  assign  a 
start  time  of  Sj  =  ju  (reduced  modulo  w)  to  the  sorter  Cj, 
0  <  j  <  N. 

During  normal  operation,  the  controls  put  sorter  Cj  into 

state  U  at  the  beginning  of  bit  time  s j .  (This  is  done  for 

all  sorters,  once  each  period.  For  the  customary  values  of  u 

(1,  w/2,  or  w-1),  implementation  is  extremely  simple. 

The  registers  Uj  and  Lj  together  form  a  ring  Rj. 

Each  ring  Rj  holds  w  bits. 

We  shall  say  that  the  sorter  is  stable  if  (1)  during  one 

period  all  sorters  stay  in  state  U  or  go  into  state  N; 

( 2 )  zeros  are  fed  into  the  low  input  and  ones  into  the  high 

input  of  the  module;  (3)  sorter  Cj  is  set  to  state  U  at 

time  Sj.  When  that  is  so,  each  ring  Rj  contains  w  bits. 

The  bit  entering  Cj  at  time  Sj  (equivalently  C j_1  at  time 

Sj_1)  is  called  bit  bQ  (or  bQj  if  we  wish  to  identify  the 

ring) .  Successive  bits,  proceeding  counter-clockwise  around 

the  ring,  are  called  bits  b.^ , b£  ,  .  .  .  , bw_^  .  Thus  the  w  bits 

w-1 

in  ring  R.  represent  a  binary  number  bnb?...b  ,  (=  Z  b.2w 

j  m  l  i=Q  1 

in  decimal);  bQ  is  the  high-order  (most  significant)  bit, 
b^  is  the  next  most  significant  bit,  ...  ,  and  finally  bw_^ 
is  the  low  order  (least  significant  bit).  Denote  the  word 

th 

bn.b-  .  •••  b  ,  •  by  B.;  it  is  the  word  cycling  in  the  j 

UJ-LJ  "“I  )  J  J 

ring.  If  the  words  Bj  and  Bj+1  are  equal  then  sorter  Cj 

will  remain  in  state  U.  If  they  are  not  equal,  then  there 

t  h 

must  be  a  first  bit,  say  the  k"  ,  in  which  they  differ.  Then 
bi,  j-l  =  bi,  j  for  0<L<K~1  but  /  bk>  .. 


By  our 
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assumption  of  stability,  when  these  bits  are  encountered  C  j 
goes  into  state  N.  This  implies  that  b  ,  =  0  and  b.  .  =  1. 

K,J-i  K-,  J 

It  follows  that  Bj_i  <  Bj*  Thus  we  see  that  during  stable 
operation,  the  words  Bj  satisfy  —  B2  —  ***  —  • 

Let  us  see  what  happens  if  a  word  Bg  =  ^00bl0 * ’ " ^w-1  0 
is  entered  into  the  low  input,  replacing  the  zeros  customarily 
entering  there,  with  bgg  entering  at  time  Sg.  If  Bg  <  B^ 
it  is  clear  that  Bg  will  be  output  immediately  from  the  low 
output.  If  however  Bg  >  B^,  then  B^  will  be  output  immedi¬ 
ately  and  Bg  will  be  sent  to  C2>  where  it  will  be  compared 
with  B2 .  If  Bg  <  B2  then  Bg  will  remain  in  ring  , 

otherwise  it  will  be  sent  to  Cg  and  compared  to  B3»  This 
will  proceed  until  a  stable  situation  is  reached.  At  this  time 
the  memory  will  contain  numbers  B^  <  B^  <  •••  <  B^ .  Thus  we 
observe  the  basic  operation  of  this  memory,  which  is  to  insert  a 
number  BQ  into  the  low  input  and  obtain  from  the  low  output 
the  smallest  number  among  Bg,B^ , . . . ,B  ^ .  Since  this  is  a 
"pipe- lined"  memory,  the  basic  operation  can  be  repeated  every 
period,  even  though  the  memory  is  not  stable. 

Similarly  if  a  number  is  entered  into  the  high 

input,  the  larger  number  among  B^  bn+i  will  be  output  from 
the  high  output,  and  the  remaining  numbers  will  move  into 
increasing  order. 

Note  that  while  one  can  successively  enter  numbers  from 


the  low  input,  once  each  period,  or  from  the  high  input,  once 
each  period,  in  order  to  switch  from  one  input  to  the  other,  it 
is  necessary  for  the  "self-sorting"  memory  to  stabilize,  and 
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this  may  take  as  many  as  N  periods. 

In  order  to  avoid  this  lengthy  delay,  we  use  the  slightly 
more  complicated  memory  shown  in  figure  3.  The  elements 
Dq»D1,  .  .  .  ,D  are  sorters,  while  the  elements  vi  »V2  ’*  '  *  ,VN-1 

are  shift  registers,  each  holding  u  bits.  The  sorters  Dj 
are  drawn  upside  down  (compared  with  the  C^);  thus  the  lower 
left  input  is  the  low  input  and  the  upper  left  output  is  the  low 
output.  Similarly  the  other  input  and  output  are  the  high  input 
and  output,  respectively.  Like  Cj,  D.  is  put  into  state  U 
at  t ime  s  . . 

During  stable  operation,  the  Uj,  Lj  and  Cj  function 

as  before.  Ring  R  .  contains  word  W  .  and  WL  <  W0  <  •••  <  W., . 

j  ]  1—  2—  —  N 

The  \L  contain  ones.  If  a  number  WQ  is  entered  into  the 
low  input  the  system  behaves  as  before.  If  a  number  WQ  is 
entered  into  the  high  input,  then  it  will  travel  along  the  path 
^  1 ,  V  2  >  *  ’  ’  » V\  - 1  •  anY  of  numbers  in  the  rings  R1,R2»-.*»RN 

is  greater  than  Wq,  it  will  be  compared  with  WQ  at  some  stage 
and  replace  Wq.  Thus  (N-l)u  bit  times  after  the  insertion  of 
Wq,  the  largest  of  the  numbers  wo,'^l,’*',^N  will  emerge  from 
the  high  output  (for  our  purposes,  this  delay  is  unimportant). 

Finally  we  will  need  one  more  control.  This  will  connect 
to  all  sorters  Cj  and  Dj.  It  will  enable,  for  given  k, 
replacement  of  the  k^^1  bit  of  each  word,  as  it  enters  the 
sorter,  by  0  (this  will  only  be  done  for  2  <  k  <  w,  and 
usually  for  2  <  k  <  6). 

Before  explaining  how  this  self -sorting  memory  is  used, 
we  note  that  if,  for  purpuses  of  VLSI  implementation,  the  connec- 
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tion  from  UN  to  is  made  external  to  the  chip  containing 

the  module,  and  the  two  control  signals  (the  first  control 
signal  puts  sorter  Cj  and  into  state  U  at  time  s^; 

the  second  sets  the  bit  of  each  word  of  the  memory  to  0) 

are  passed  both  in  and  out  of  the  chip,  self-sorting  memories 
can  be  strung  together,  one  after  another,  and  chip  limitations 
do  not  limit  the  size  of  the  memory,  nor  do  time  delays  arising 
from  the  finite  speed  of  pulse  transmission).  In  addition, 
there  are  rei  ively  simple  modifications,  which  allow  what  is, 
in  essence,  bit  parallel  operation  of  these  memories,  and  which 
for  a  given  clock  speed,  allow  a  much  greater  memory  speed.  We 
do  not  describe  these  here,  although  they  might  be  used  in  an 
actual  implementation. 

2 .  The  array-processor 

The  array-processor  consists  of  a  sequence  of  M  identical 
modules  as  shown  in  figure  4.  Each  module  consists  of  a  pro¬ 
cessor  (P)  and  attached  self-sorting  memory  holding  N  words. 
Each  SSM  connects  only  to  its  processor,  and  its  lowest  ring 
is  an  addressable  register  in  its  processor.  Each  processor, 
except  those  at  the  end,  are  connected  by  bi-directional  lines 
to  each  of  the  eight  adjacent  processors  (four  on  each  side). 

For  simplicity  these  connections  are  not  shown  on  the  diagram. 

In  addition,  there  are  two  common  buses  to  which  all  processors 
have  access.  Attached  to  one  of  these  buses  is  one  reciprocal 
unit  (which  given  x  /  0,  returns  l/x)  and  one  square-root 
unit  (which  given  x  >  0  returns  i/x) .  The  square  root  unit 
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and  reciprocal  unit  may  be  physically  the  same.  The  processors 
are  numbered  Pi»P2 *  *  *  *  * pm‘ 

The  processors  have  limited  capability.  They  have  a  few 
registers  for  holding  floating  point  numbers.  They  can  add  and 
multiply  both  fixed  and  floating  point  numbers  (division  and 
square  root  are  not  necessary),  compare  floating  point  numbers, 
and  perform  limited  bit-manipulation  operations.  They  have  a 
small  program-store  (which  can  be  in  ROM,  or  in  RAM  which  is 
down-loaded  from  the  controlling  computer).  It  is  convenient 
(for  the  moment)  to  assume  the  processors  operate  in  a  synchronous 
mode.  We  shall  call  the  time  necessary  to  perform  one  period  of 
the  self-sorting  module  (w  bit  times)  followed  by  a  floating 
point  operation  of  the  complexity  a,b,c-*a  +  b*c  (or  similar 
operation)  a  tick.  We  assume  that  during  the  first  part  of  the 
tick  (when  the  processor  is  communicating  with  its  self-sorting 
memory)  a  processor  can  also  send  or  receive  (up  to  three)  words 
of  w  bits  from  its  adjacent  processors  (i.e.,  interprocessor 
communication  time  is  no  more  than  one  self-sorting  module 
period).  Later  we  shall  show  that  these  restrictions  are  not 
all  necessary,  and  that  the  array-processor  can  take  advantage 
of  faster  inter-processor  communication  (especially  when  working 
with  sparse  matrices) .  It  is  convenient  to  assume  that  the 
reciprocal  unit  and  the  square-root  unit  also  function  (including 
communication)  in  one  tick.  However,  if  this  is  not  so,  the 
entire  processor  will  function  only  slightly  slower.  For 
example,  in  Gaussian  elimination  of  an  n  by  n  matrix,  n  divi¬ 
sions  are  required,  and  it  is  the  extra  time  for  these  n  divi- 


sions  that  would  have  to  be  included. 

As  we  shall  see,  an  arbitrary  (invertible)  dense  n  X  n 

2 

matrix  can  be  inverted  in  n  ticks  (assuming  our  array  contains 
at  least  n  +  2  processors).  Thus  if  a  tick  takes  10  |is,  then 
a  500  by  500  dense  matrix  can  be  inverted  in  2.5  seconds.  We 
should  also  note,  in  this  regard,  that  the  primary  limitation, 
on  the  size  of  matrices  that  can  be  handled,  is  that  the  total 
memory  capacity  (of  the  self-sorting  memories)  suffices  to  hold 
the  original  matrix  and  all  intervening  matrices  occurring  during 
the  computation  (in  general,  each  self-sorting  memory  will  hold 
1  or  more  columns  of  a  matrix,  but  a  column  must  be  entirely 
contained  within  one  self-sorting  memory) . 

3 .  Data  manipulation 

We  shall  assume  throughout  the  remainder  of  this  paper  that 
all  data  items  have  a  leading  0,  i.e.  bQ  =  0.  We  shall  call 
any  word  with  a  leading  1  (bQ  =  1)  a  blank.  To  the  self- 
sorting  memory  (SSM),  blanks  are  larger  than  any  valid  datum, 
hence,  during  computations,  each  SSM  will  contain  valid  data 
followed  by  blanks. 

Note  that  a  processor  can  perform  the  following  three  opera¬ 
tions  on  its  attached  SSM. 

1)  it  can  delete  the  least  item  (which  will  be  transferred 
into  a  register  of  that  processor)  by  inserting  a  blank  into  the 
low  input  of  the  SSM; 

2)  it  can  add  an  item  into  the  SSM  (displacing  a  blank 

if  any  are  present,  otherwise  displacing  the  largest  datum,  which 
will  be  lost)  by  inserting  the  item  into  the  high  input; 
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3)  it  can  enter  a  new  item  into  the  low  input;  the  SSM 
will  simultaneously  write  out  the  smallest  of  the  valid  data 
items  (including  the  new  one)  present. 

Let  us  first  see  how  the  array-processor  sorts  a  list  of 
L  <  MN  numbers  in  2L  +  N  ticks.  We  first  (simultaneously) 
load  all  of  the  SSM  with  blanks.  This  will  take  N  ticks 
using  step  (1)  above  (with  a  slightly  more  elaborate  SSM  control, 
this  could  be  done  in  1  tick). 

The  data  items  are  sent  one-by-one  into  SSM  1 ,  using  step 
(2)  until  N  items  have  been  entered.  Then  it  will  continue 
inserting  items  using  step(3),  the  displaced  items  will  be  sent 
by  SSM  2,  where  they  will  be  inserted  as  before,  then  to  SSM  3, 
etc.,  until  SSM  k  (k<M)  is  partially  filled.  Then,  starting 
with  SSM  k,  they  will  be  written  out,  one-by-one,  using  step 
(1).  The  items  will  then  appear  in  ascending  sorted  order.  If 
the  items  to  be  sorted  are  keys  k^  from  larger  records  r^ , 
suppose  p^  is  a  pointer  to  r^  and  form  the  word  =  (k^,p^) 

(i.e.  the  pointer  p^  is  concatenated  to  the  key  k^).  As  long 
as  the  total  number  of  bits  in  k^  and  p^  is  less  than  w, 
the  keys  and  pointers  may  be  sorted  together.  If  descending 
order  of  sort  is  desired,  processor  P^  may  be  programmed  to 
take  the  ones  complement  of  the  x^  as  they  enter  and  recomple¬ 
ment  when  they  leave.  Other  preliminary  transformations  may  be 
necessary  if  the  k^  represent  signed-magnitude  or  twos -complement 
numbers.  Such  transformations  can  also  be  performed  in  P^. 

We  now  show  how  a  single  processor-SSM  combination  can 
perform  data  manipulation  on  vectors.  We  describe  a  slightly 
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simplified  version  of  our  procedure.  We  represent  a  vector 
entry  as  follows. 


0 

Lu 

Index  ( I ) 

Value  (V) 

Here  the  leading  0  is  that  required  of  all  valid  data;  F 
denotes  a  one  bit  "flag"  field;  I  is  the  subscript  or  index  of 
the  given  vector  entry;  finally  V  is  its  numerical  value 
(probably  in  floating  point).  Note  that  the  flag  and  index  are 
the  high  order  part  of  the  word,  not  its  value,  and  thus  the  SSM 
will  sort  on  indices,  not  on  the  value  (as  we  shall  see,  in 
Gaussian  elimination,  the  pivot  entry  is  determined  by  compari¬ 
sons  within  the  processor,  not  by  the  SSM).  If  V  =  0,  then 
the  word  is  not  stored  in  the  SSM.  This  last  rule  enables  us  to 
take  advantage  of  sparse  vectors.  In  practice,  I  might  be  a 
20  bit  field  and  V  an  80  bit  field. 

Let  us  suppose  that  an  SSM  contains  all  the  (non-zero) 
elements  of  a  vector,  and  that  we  wish  to  read  them  out  one-by- 
one,  modify  them,  and  then  replace  them.  We  assume  all  the  flag 
bits  are  0  (there  is  a  control  pulse  in  the  SSM  to  set  them 
to  0,  if  necessary) . 

Using  operation  (1)  we  move  the  first  element  of  the  SSM 
into  the  processor.  After  we  modify  it,  we  set  its  flag  bit  to 
1  and  use  operation  (3)  to  replace  it,  and  get  the  next  element. 
Since  the  first  element  is  large,  it  will  "bubble"  to  the  top  of 
the  SSM.  We  repeat  this  operation  until  we  have  passed  throuqh 
all  valid  data  items.  The  latter  will  be  signalled,  either  by 
a  count,  known  in  advance,  or  by  the  appearance  of  an  element 
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with  flag  bit  1.  At  that  point,  we  complete  the  operation  by 

setting  all  flag  bits  to  0.  To  interchange  two  elements,  say 

the  i^^1  and  the  j*"*1,  we  simply  replace  the  index  i  by  the 

til 

index  j  in  the  i  element,  when  it  appears,  and  replace  the 

th 

index  j  by  the  index  i  when  the  j  word  appears. 

We  note  that,  in  a  similar  fashion,  the  single  processor- 
combination  can  perform  the  "perfect  shuffle"  of  Stone  [11], 
which  is  used  in  the  Fast  Fourier  Transform  (and  numerous  other 
algorithms).  Recall  that  in  order  to  perform  the  perfect  shuffle, 
it  is  necessary  to  have  n  =  2  items,  and  that  item  j, 

0  <  j  <  n,  is  sent  to  item  2j  if  2j  <  n  and  to  item 
2j+l-  n  if  2j  >  n  (this  amounts  to  a  circular  left  shift  by 
1  of  the  index) .  The  flag  bit  is  used  as  before. 

In  order  to  perform  an  FFT,  all  of  the  n  roots  of 
unity  are  needed,  in  an  appropriate  order.  It  is  not  hard  to 
see  that  two  (since  they  are  complex)  SSM's  could  be  used  to 
hold  the  appropriate  roots  of  unity.  Thus  the  array  processor 
could  perform  [K/2J  -  2  simultaneous  FFT’s  in  2kn  ticks. 

We  note  finally,  that  any  permutation  it  of  a  vector  for 
which  Tr(i)  is  an  "easily  computable"  function  of  i  can  be 
performed  in  the  SSM. 

By  using  two  adjacent  processor-SSM * s  we  can  apply  an 
arbitrary  permutation  r  to  the  indices  of  a  vector.  One  SSM 
holds  words  with  index  field  i  and  data  field  r(i)  (if 
r  ( i )  =  i,  the  word  need  not  be  stored).  The  vector  entries 
and  permutations  are  read  out,  and  i  is  replaced  by  r(i)  in 
the  vector  entry,  the  flag  bits  are  set  to  1,  and  the  process 
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continues  • 

We  note,  finally,  that  many  of  the  algorithms  will  require 
more  than  one  flag  bit,  and  we  usually  will  reserve  three  flag 
bits  • 

4 .  Gaussian  Elimination 

The  purpose  of  this  section  is  to  describe  the  algoritb n 
for  performing  Gaussian  elimination,  with  partial  pivoting,  on 
dense  matrices,  using  the  array- processor .  While  basically 
very  simple,  the  algorithm  because  of  its  non-standard  order  of 
computation  and  parallel  computations  is  somewhat  difficult  to 
describe.  Let  us  recall  the  standard  Gaussian  elimination  for 
transforming  an  n  by  n  matrix  A  to  LU  form  [4].  We  wish 
to  write  A  =  LU  where  L  is  a  lower  diagonal  matrix  with 
ones  on  the  diagonal  and  U  is  an  upper  diagonal  matrix.  We 
define  a  sequence  of  matrices  A  =  (a^j  ),  0  <  i,j,k  <  n, 

with  A^0^  =  A  and  then  successively  for  k  =  l,2,...,n,  A^ 
is  defined  by 


(1) 

■i? 

-  Jk-D 

"  aiJ 

//aK.k^  *  k  +  1  <  i  <  n 

(2) 

a!k> 
1  J 

- 

**■  a  •  • 

i  j 

_  (k-1)  (k-1)  /  (k) 

ik  kj  /akk 

-  Jx-U 

-  a  «  • 

i  j 

-  a<i,4^1)  ,  k  +  1  < 

ik  kj  — 

(3) 

a(k) 

akj 

-  Jk-D 

~  akj 

otherwise  . 

( n ) 

Then  A^  =(L-I)+U;  i.e.  the  elements  below  the  main 
( n ) 

diagonal  of  A'  equal  the  corresponding  elements  of  L  (of 
course  the  diagonal  elements  of  L  are  1),  while  the  elements 
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on  or  above  the  main  diagonal  equal  the  corresponding  elements 
of  U.  The  above  algorithm  somewhat  rashly  assumes  that  the 
"pivot  elements"  a^  ,  1  <  k  <  n,  are  not  zero  (and  to  avoid 

numerical  instabilities  not  too  "small").  We  shall  later  drop 
this  assumption,  but  continue  it  for  the  moment. 

The  normal  implementation  of  Gaussian  elimination  is  given 
by  (using  a  self-explanatory  "Pidgin"  PL/1): 


do  k  =  1  to  n; 

do  i  =  k  +  to  n; 

aik  "  aik//akk5 

end ; 

do  i  =  k  +  1  to  n; 

do  j  =  k  +  1  to 


.  .  =  a  •  •  -  a . ,  *  a.  • } 
ij  ij  ik  kj* 


end ; 


end ; 


end ; 


It  is  easy  to  verify  that  the  outer  loop  on  k 

,  . (k-1) 

from  A 


computes 


.IK) 


t  h 

Let  us  denote  the  i  row  of  a  matrix  A  =  (a.  •)  by  a. 

1  J  1  * 

and  the  j*'*1  column  by  a*j’  Doolittle  method  of 

Gaussian  elimination  is  obtained  by  noting  that,  in  the  above 
algorithm  a^^  =  a^£  1'>  =  •••  =  a^^  ,  and  computing 
column  by  column.  This  yields  the  following  algorithm. 


•V 


do  k  =  1  to  n; 

do  j  =  1  to  k-1; 

do  i  =  j  +  1  to  n; 


a .  ■  =  a .  .  -  a . .  *  a,  . ; 

l  j  l  j  ik  kj 


end ; 


do  i  =  k  +  1  to  n; 


aik  =  ai)/akk5 


end; 


It  i.s  a  modified  version  of  this  latter  algorithm  that  we  shall 
implement.  Note  that  the  n  -  j  computations  a^ j  =  a^^  - 
a^R  *  a^j  of  the  inner  loop  above  may  be  performed  simultaneously 
(and  this  is  what,  in  essence,  we  shall  do).  The  above  algorithm 
can  then  be  conveniently  implemented,  in  a  parallel  fashion,  on 
our  array-processor  as  follows:  We  compute  the  columns  success¬ 
ively.  To  compute  the  k*"*1  column,  the  elements  ai^  » a2k^  *  *  *  *  » 
a^k^  are  fed  in  successively,  one  per  tick,  from  the  left. 

When  element  a\_^  (  =  aix^  enters  it  is  held  in  a  special 

register.  One  tick  later  it  is  sent  to  P^ •  During  the  same 
tick  a^k ^  enters  P1,  a^^  as  brought  out  of  SSM1  and 

=  a2^  "  a21^alk^  ;'‘s  comPute<3  i-n  Pj*  During  the  next 
tick  the  following  operations  are  performed: 


l  K.  J 

(i)  a^  moves  from  to  P^, 


(ii)  a^k  moves  from  P1  to  P^,  where  it  is  held  for 
later  use,  and 

(iii)  a3^  =  a3k">  ~  a31^alk^  is  comPuted  in  Pj*  This 
continues  through  the  first  k-1  columns.  As  the  elements 


*-  v 


a.^  ,  i  <  k  enter  they  are  passed  into  SSM^.  When  akk 

enters,  it  is  passed  into  SSM^,  and  its  reciprocal  is  calculated 

in  the  reciprocal  unit  and  saved  in  P  .  When  the  elements 

K. 

ajX_1\  i  <  k  <  n,  enter  P^_,  the  element  ajj^  =  ajj^  = 
a^X_  1  ^  •  (  l/al^  )  is  computed  and  sent  to  SSM.  .  Note  that  in 
the  tick  following  computation  of  a^^  =  a^^  -  a^^alj^  in 
P  ,  a,  enters  P  and  the  above  sequence  of  computations 

is  repeated  (with  the  appropriate  changes)  for  the  (k+l)st 
column . 

If  the  ticks  are  numbered  so  that  a^^  enters  C.^  at 

tick  1,  then  enters  P1  at  tick  n(k-l)  +  1  and  a^j^ 

2  ( 1 ) 

enters  P.  at  time  (n+l)(k-l)  +  1.  At  tick  n  ,  a„„  is 

k.  nn 

2 

computed  in  P^,  and  at  time  n  +1  we  may  start  outputting 

( n  )  .2 

the  first  column  of  A  .  At  tick  n  +  n  +  1,  all  entries  of 
A^n^  have  been  computed. 

Let  us  now  investigate  the  changes  necessary  in  order  to 

allow  partial  pivoting.  Recall  that  this  means  that  when  com- 

putina  A^X^  from  A^X  ^  the  largest  element  in  column  k 

is  chosen  as  a  pivot  element.  Specifically,  suppose  that  after 
( k- 1 ) 

A  has  been  chosen,  p^.  is  chosen,  k  <  <  n,  so  that 

|a^X1,1^|  =  max  |a^X  1^|«  Then  a  permutation  t.  of  1,2,...  ,r 
pkK  k<i<n  lK  X 

is  chosen  which  fixes  l,2,...,k-l,  replaces  k  by  p^,  and 

is  otherwise  arbitrary.  Two  convenient  choices  are  =  (k,pk) 

(i.e.  k  and  p^  are  interchanged)  and  =  ( k,k+ 1 , . . . , p^) 

(i.e.  pk+1  replaces  k,  and  k,k+ 1 , . . . ,k+p^-l  each  replace 

their  successors).  Let  be  the  permutation  matrix  which,  by 

left  multiplication,  permutes  the  rows  of  Av  according  to 
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the  permutation  r^.  Then  before  computing  A^,  A^*  is 

( k—  1 ) 

replaced  by  cj^A  and  computation  proceeds  as  before.  Put 

SR  =  dkd}t_1*  •  and  •  •  »r ^ .  if  we  were  omniscient 

and  knew  in  advance  the  permutations  '^2' *  '  '  ,Trn'  we  could 
apply  the  above  algorithm  to  S^A.  Alternatively,  still  omni¬ 
scient,  we  use  the  structure  of  figure  5  (which  is  simply  a 
convenient  renaming  of  our  array-processor,  with  the  first 
two  processors  and  SSM's  combined  to  form  Mq ) •  Using  this 
structure  we  could  attempt  to  perform  Gaussian  elimination  using 
the  following  scheme.  The  elements  of  A^0^  would  enter  Mq 

and  then  follow  the  same  route  as  in  the  earlier  structure. 

t  h 

However  as  the  k  column  a*k  enters  and  leaves  Mq,  it 

would  be  replaced  by  Ska*k’  as  passes  through  P^,  we 

would  find  the  j  column  permuted  by  T  ;  i.e.,  M.  would 

K  J 

contain  S,  T  .  and  thus  be  ready  for  column  k,  and  as  column 
K  J 

k  passed  by  we  would  apply  rk+1  to  the  jth  column. 

Unfortunately,  the  above  method  is  not  feasible,  for  we  do 
•  ,  ( k- 1 ) 

not  know  the  permutation  r,  until  all  of  a'-  is  computed  . 

However  a  modified  version  can  be  used,  and  we  proceed  to  describe 
it . 

i.  y. 

Assume  for  the  moment  that  k  >  4.  As  above,  the  k 
column  enters  and  leaves  MqJ  (as  we  shall  shortly  see)  at  the 
time  it  enters  Mq,  permutation  sk_3  has  been  determined,  but 

Sk-2  and  Sk-1  are  not  yet  ^nown*  So  we  use  to  apP-*-y 
the  permutation  T^_3  to  obtaining  sx-3a*k^ *  As  before 

the  successive  elements  of  sx-3a*k^  are  sent  successively 

through  P]_»P2*  "  ’  *  ,Pk-3*  As  t^1^s  occurs,  Tk-2’  now  knowri»  is 
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applied  to  these  columns.  The  contents  of  SSMj,  1  <  k  <  k-3, 

are  at  this  time,  as  we  shall  arrange,  1  <  k  <  k-3, 

respectively.  Thus  they  are  in  the  same  order  as  S^^a^ 

and  we  may  calculate  Sk_ga^k  and  send  it  directly  to 

( k-  3 ) 

(not  Px-3^}  at  t^ie  same  ti™6  we  may  send  S^_^a^  from 

(k-3) 

P^._3  to  P^,  where  we  compute  sx-2a*k  »  an<^  pass 
appropriately  permuted  into  M^.  The  important  consideration, 
easily  verified  by  computing  times,  is  that  r^_2  is  currently 

available  and  that,  currently,  in  p^_2  we  are  computing 

(k-2)  (k-2) 

S  oal  v  4  and  in  i  we  are  computing  S  „a'  .  ' . 

K""Z  "  f  K~  Z  K—  1  K” Z  *  f  K—  J. 

Now  the  following  method  is  employed:  As  we  compute 
(k-2) 

S,  oaI  v  i  we  1°0}<  for  its  pivot  entry.  We  do  this  by  holding 

K- Z  *  y  _L 

S  t 

the  (k-1)  entry  as  it  comes  by.  If  a  larger  entry,  say 

the  Ith  appears,  we  give  what  was  previously  entry  (k-1)  the 

sfc 

index  Z  and  store  it,  replacing  the  (k-1)  entry  by  the 

Z  .  We  say  tentatively  that  Tk_1=(k-1 , Z ) .  Exactly  the  same 

interchanges  are  performed  on  the  entries  entering  SSMk_2  and 

tl"l 

SSM^.  Thus,  when  the  k  column  is  completely  loaded  into 

SSM^,  SSMk_2  contains  S^_ £n^._ 2 »  ssmk-1  contains 

Sk_^a^  k_^,  SSM]<  contains  k  ;  the  pivot  entry  is 

known  for  column  k-1  and  r  is  known  (to  be  sent  to  MQ 

and  P1,P2»...,Pk  ^) .  We  are  now  ready  to  compute  SK_ia*  k  . 

This  requires  two  processors  and  we  use  p^-_2  (which  is  free, 

for  the  column  that  would  normally  go  to  that  processor  is  being 

( k—  1 ) 

sent  to  ?k+1)  and  pk*  sx-l*  k  is  comPute<^  and,  as  it  is 
«  ,  ( k- 1 ) 

stored  into  M^,  it  becomes  S^a^  k_^.  At  the  same  time,  the 
pivot  element  and  are  obtained  simultaneously. 


It 
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Let  us  summarize  the  timing  aspects  of  the  above  by  noting 
at  what  tick  the  first  element  of  column  k  arrives  at  various 
places  in  the  array-processor. 


Tick 

First  entry 

arrives  at 

Current  value  of  column  k 

n ( k- 1 ) + 1 

Mo 

*  k 

nk  +  1 

P1 

Sk-3Alk) 

nk+k-4 

Pk-4 

S  -,A(k~5) 
k-3  *,k 

nk+k-3 

Pk 

S  A(k"4^ 

k-3  *,k 

n  ( k+  1 )  +  k-  3 

Pk-2 

S  A(k_3) 

k-lA*,k 

n ( k+ 1 )+k-2 

P. 

S  A(k"2) 

k 

k-1  *,k 

n ( k+  2 )+k-l 

Pk 

S  A(k_1) 

k-1  *,k 

n  ( k+2  )+k 

Pk 

S  A^  k  ^ 
k  *,k 

Finally  we  note  that  the  above  procedure  requires  the 
first  three  columns  to  be  appropriately  initialized;  we  omit 
the  description  of  this  straight-forward  procedure. 

5 .  Sparse  Matrices 

We  now  indicate  how  the  above  methods  are  used  for  sparse 
matrices.  We  use  the  following  word  format* 


0 

F 

F_ 

— 

F, 

column 

value 

1 

2 

3 

Here  the  0  is  the  leading  zero  required  for  valid  data,  F1, 
F2  »F^  are  flag  bits.  "Row"  and  "column"  are  20  bit  fields, 
allowing  for  matrices  of  size  (22^-l)  by  (2  -1)  and  the  value 

field  might  be  84  bits  f or  a  word  length  of  128  bits. 


/ 
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We  will  allow  several  columns  of  a  matrix  to  be  in  one  SSM, 

but  will  not  split  a  column  between  two  SSM's.  We  assume  the 

processor  has  one  special  register  for  each  column  held  in  its 

SSM  (this  can  be  avoided  at  the  expense  of  essentially  halving 

the  speed  of  the  array-processor).  Gaussian  elimination  proceeds 

th 

as  before,  except  that  as  the  k  column  passes  by  the  proces¬ 
sors,  several  updates  of  an  element  may  occur  (i.e.  computations 
of  the  form  a .  •  «-  a .  .  -  a . ,  •  a.  . )  for  several  columns  may  re- 
side  in  the  corresponding  SSM.  They  will  be  computed  consecu¬ 
tively  (for  elements  with  the  same  row  number  will  be  adjacent). 
I’ll  is  system,  while  it  may  operate  synchronously  at  the  bit  level, 
should  operate  asynchronously  at  the  word  level.  For  whenever 
all  updates  that  can  occur  in  a  processor  have  occurred,  the 
entry  should  be  immediately  sent  to  the  next  processor,  if  that 
processor  is  not  busy.  To  speed  up  the  procedure  even  more  it 
would  be  appropriate  to  have  first-in,  first-out  buffers  between 
processors  (these  could  be  additional  SSM’s).  Then  a  processor 

could  send  an  entry  on,  even  if  the  next  processor  is  busy.  Of 

th 

course  when  the  k  column  is  being  updated,  columns  k-3,  k-2, 
k-1,  k  will  be  kept  in  distinct  SSM's  (so  that  the  speed-up 
trick  described  in  the  last  section  may  be  used).  But  as  the 
(k+l)st  column  comes  by,  the  (k-3)th  column  will  be  moved  to 
the  preceding  SSM,  if  there  is  room.  If  that  occurs  each  of  the 
columns  k-2,  k-1,  k  will  be  moved  to  the  preceding  SSM. 

We  note  that  many  variants  of  the  above  algorithms  are 
possible,  and  much  further  investigation  and  simulation  is 
necessary  to  determine  which  is  best. 
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In  addition,  it  has  been  tacitly  assumed  that  the  main 
computer,  to  which  our  array-processor  has  been  attached,  will 
initially  permute  the  rows  and  columns  of  the  matrix  to  make  it 
close  to  triangular  or  bounded  so  as  to  minimize  "fill-in”. 
Standard  algorithms  for  this  purpose  may  also  be  performed  in  our 
array-processor  (for  example,  during  partial  pivoting  it  may  be 
better  to  choose  as  a  pivot  element,  not  the  largest,  but  one 
which  is  not  too  small,  but  whose  row  is  sparse). 

Similar  considerations  apply  to  other  standard  algorithms 
in  matrix  theory. 


REFERENCES 

1.  P.  Armstrong  and  M.  Rem,  A  serial  sorting  machine,  Journal 
of  computers  and  Electrical  Engineering  (to  appear). 

2.  l.S.  Duff  and  G.  W.  Stewart,  Sparse  matrix  proceedings, 

1978,  SIAM  (1979),  ISBN  0-89871-160-6. 

3.  D.  Heller,  A  survey  of  parallel  algorithms  in  numerical 
linear  algebra,  SIAM  Review  20  (1978),  p.  740-777. 

4.  A.  Jenninqs,  Matrix  computation  for  engineers  and  scientists, 
Wiley  (1977),  ISBN  0-471-99421-9,  p.  250-278. 

5.  A.  K.  Jones  and  P.  Schwarz,  Experience  using  multiprocessor 
systems  --  A  status  report,  ACM  Computing  Surveys  12  (1980), 
p.  121-165. 

6.  D.  Knuth,  The  art  of  computer  programming  V.  3  —  Sorting 
and  searching,  Addi son-Wesley  (1973),  ISBN  0-201-03803-X . 

7.  H.  T.  Hung  and  C.  Leiserson,  Systolic  arrays  (for  VLSI), 
in  [ 2 ] ,  p.  256-282 . 


t 


-22- 


8.  C.  Mead  and  L.  Conway,  Introduction  to  VLSI  systems, 
Addison-Wesley  (1980),  ISBN  0-201-04358-0. 

9.  W.  Orchard-Hays,  Advanced  linear  programming  techniques, 
McGraw-Hill  (1968). 

10.  D.  J.  Rose  and  R.  A.  Willoughby,  Sparse  matrices  and  their 
applications  --  proceedings  of  a  symposium.  Plenum  ( 1972 ) , 
ISBN  0-306-30587-9 

11.  H.  S.  Stone,  Parallel  processing  with  the  perfect  shuffle, 
IEEE  Trans,  on  Comp,  c-20  (1971),  p.  153-161. 

12.  J.  A.  Tomlin,  Modifying  triangular  factors  of  the  basis  in 
the  simplex  method,  in  [10],  p.  77-85. 


*  r 


F  i  ciure 


